{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text sequences to TFRecords\n",
    "----\n",
    "\n",
    "Hello everyone! In this tutorial, I am going to show you how can you parse your raw text data to TFRecords. I know that many people struggle with input processing pipelines, especially when you start working on your own personal project. So I really hope it is going to be useful for any of you :)!\n",
    "\n",
    "### Tutorial flowchart\n",
    "----\n",
    "![img](tutorials_graphics/text2tfrecords.png)\n",
    "\n",
    "\n",
    "### Dummy IMDB text data\n",
    "----\n",
    "For practice, I have chosen a few data samples from the Large Movie Review Dataset offered by Stanford."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import here useful libraries\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "import tensorflow as tf\n",
    "import pandas as pd\n",
    "import pickle\n",
    "import random\n",
    "import glob\n",
    "import nltk\n",
    "import re\n",
    "\n",
    "try:\n",
    "    nltk.data.find('tokenizers/punkt')\n",
    "except LookupError:\n",
    "    nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse data to TFRecords\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def imdb2tfrecords(path_data='datasets/dummy_text/', min_word_frequency=5,\n",
    "                   max_words_review=700):\n",
    "    '''\n",
    "    This script processes the data and saves it in the default TensorFlow \n",
    "    file format: tfrecords.\n",
    "    \n",
    "    Args:\n",
    "        path_data: the path where the imdb data is stored.\n",
    "        min_word_frequency: the minimum frequency of a word, to keep it\n",
    "                            in the vocabulary.\n",
    "        max_words_review: the maximum number of words allowed in a review.\n",
    "    '''\n",
    "    # Get the filenames of the positive/negative reviews \n",
    "    pos_files = glob.glob(path_data + 'pos/*')\n",
    "    neg_files = glob.glob(path_data + 'neg/*')\n",
    "\n",
    "    # Concatenate both positive and negative reviews filenames\n",
    "    filenames = pos_files + neg_files\n",
    "    \n",
    "    # List with all the reviews in the dataset\n",
    "    reviews = [open(filenames[i],'r').read() for i in range(len(filenames))]\n",
    "    \n",
    "    # Remove HTML tags\n",
    "    reviews = [re.sub(r'<[^>]+>', ' ', review) for review in reviews]\n",
    "        \n",
    "    # Tokenize each review in part\n",
    "    reviews = [word_tokenize(review) for review in reviews]\n",
    "    \n",
    "    # Compute the length of each review\n",
    "    len_reviews = [len(review) for review in reviews]\n",
    "\n",
    "    # Flatten nested list\n",
    "    reviews = [word for review in reviews for word in review]\n",
    "    \n",
    "    # Compute the frequency of each word\n",
    "    word_frequency = pd.value_counts(reviews)\n",
    "    \n",
    "    # Keep only words with frequency higher than minimum\n",
    "    vocabulary = word_frequency[word_frequency>=min_word_frequency].index.tolist()\n",
    "    \n",
    "    # Add Unknown, Start and End token. \n",
    "    extra_tokens = ['Unknown_token', 'End_token']\n",
    "    vocabulary += extra_tokens\n",
    "    \n",
    "    # Create a word2idx dictionary\n",
    "    word2idx = {vocabulary[i]: i for i in range(len(vocabulary))}\n",
    "    \n",
    "    # Write word vocabulary to disk\n",
    "    pickle.dump(word2idx, open(path_data + 'word2idx.pkl', 'wb'))\n",
    "        \n",
    "    def text2tfrecords(filenames, writer, vocabulary, word2idx,\n",
    "                       max_words_review):\n",
    "        '''\n",
    "        Function to parse each review in part and write to disk\n",
    "        as a tfrecord.\n",
    "        \n",
    "        Args:\n",
    "            filenames: the paths of the review files.\n",
    "            writer: the writer object for tfrecords.\n",
    "            vocabulary: list with all the words included in the vocabulary.\n",
    "            word2idx: dictionary of words and their corresponding indexes.\n",
    "        '''\n",
    "        # Shuffle filenames\n",
    "        random.shuffle(filenames)\n",
    "        for filename in filenames:\n",
    "            review = open(filename, 'r').read()\n",
    "            review = re.sub(r'<[^>]+>', ' ', review)\n",
    "            review = word_tokenize(review)\n",
    "            # Reduce review to max words\n",
    "            review = review[-max_words_review:]\n",
    "            # Replace words with their equivalent index from word2idx\n",
    "            review = [word2idx[word] if word in vocabulary else \n",
    "                      word2idx['Unknown_token'] for word in review]\n",
    "            indexed_review = review + [word2idx['End_token']]\n",
    "            sequence_length = len(indexed_review)\n",
    "            target = 1 if filename.split('/')[-2]=='pos' else 0\n",
    "            # Create a Sequence Example to store our data in\n",
    "            ex = tf.train.SequenceExample()\n",
    "            # Add non-sequential features to our example\n",
    "            ex.context.feature['sequence_length'].int64_list.value.append(sequence_length)\n",
    "            ex.context.feature['target'].int64_list.value.append(target)\n",
    "            # Add sequential feature\n",
    "            token_indexes = ex.feature_lists.feature_list['token_indexes']\n",
    "            for token_index in indexed_review:\n",
    "                token_indexes.feature.add().int64_list.value.append(token_index)\n",
    "            writer.write(ex.SerializeToString())\n",
    "    \n",
    "    ##########################################################################     \n",
    "    # Write data to tfrecords.This might take a while.\n",
    "    ##########################################################################\n",
    "    writer = tf.python_io.TFRecordWriter(path_data + 'dummy.tfrecords')\n",
    "    text2tfrecords(filenames, writer, vocabulary, word2idx, \n",
    "                   max_words_review)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "imdb2tfrecords(path_data='datasets/dummy_text/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse TFRecords to TF tensors\n",
    "----"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_imdb_sequence(record):\n",
    "    '''\n",
    "    Script to parse imdb tfrecords.\n",
    "    \n",
    "    Returns:\n",
    "        token_indexes: sequence of token indexes present in the review.\n",
    "        target: the target of the movie review.\n",
    "        sequence_length: the length of the sequence.\n",
    "    '''\n",
    "    context_features = {\n",
    "        'sequence_length': tf.FixedLenFeature([], dtype=tf.int64),\n",
    "        'target': tf.FixedLenFeature([], dtype=tf.int64),\n",
    "        }\n",
    "    sequence_features = {\n",
    "        'token_indexes': tf.FixedLenSequenceFeature([], dtype=tf.int64),\n",
    "        }\n",
    "    context_parsed, sequence_parsed = tf.parse_single_sequence_example(record, \n",
    "        context_features=context_features, sequence_features=sequence_features)\n",
    "        \n",
    "    return (sequence_parsed['token_indexes'], context_parsed['target'],\n",
    "            context_parsed['sequence_length'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want me to add anything to this tutorial, please let me know and I will be happy to further enhance it :)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
